Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask reproject #845

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft

Conversation

keflavich
Copy link
Contributor

WIP: trying to get dask reprojection working.

@keflavich
Copy link
Contributor Author

this is mostly just copy-pasted at barely-past-the-idea stage. @astrofrog any chance you (or aperio folks) could find some time to work on this?

partly, I'm wondering if we should be pushing the "don't put everything in memory" part into reproject or leave it here in spectral-cube?

@astrofrog
Copy link
Member

FYI I am currently working on dask support in reproject - more soon!

@keflavich
Copy link
Contributor Author

@astrofrog can you help me out here - do we need this in spectral-cube or can spectral-cube just call reproject and not worry about it?

@astrofrog
Copy link
Member

@keflavich - right now I've added the ability in reproject to use dask behind the scenes to operate over data chunks, but I haven't yet made it possible to pass in/get out dask arrays. It shouldn't be too difficult to do that though, so if you are willing to wait a little longer, then it will be possible to just pass the dask array to reproject and get a dask array back with the reprojected cube. I can try and prioritize that for the next couple of weeks.

@astrofrog
Copy link
Member

Just curious, would you actually want the reprojection to be lazily done? That is, creating the new cube would be 'instantaneous' and reprojection would only happen as parts of the cube are accessed. Or would you want the reprojection to actually be computed in full?

@keflavich
Copy link
Contributor Author

keflavich commented Mar 13, 2023

I'm not entirely sure yet - I'm working this out as I go.

The active use case is:

  • we have a gigantic mosaic of the Galactic Center , 27800 x 10200 pixels
  • we have cubes covering this whole zone with >1000 spectral pixels/cube, so totaling ~1 TB for the final cube
  • there are 40 independent cubes we have to first convolve, then regrid to that big cube shape

#868 is my attempt to make this a bit possible

ideally under the hood we'd reproject to the minimal overlap region... where should that step be done?

@astrofrog
Copy link
Member

astrofrog commented Mar 13, 2023

@keflavich - what do you mean by minimal overlap region? The mosaic WCS, or the subset of the mosaic WCS that overlaps with each cube?

@keflavich
Copy link
Contributor Author

I mean the corner-finding that is done in coadd.py.

I'm refactoring to use coadd.py with some additions: astropy/reproject#351

@astrofrog
Copy link
Member

we have cubes covering this whole zone with >1000 pixels/cube, so totaling ~1 TB/cube

Do you mean they have >1000 pixels on each side? Do you really mean 1Tb/cube or 1Gb/cube?

@keflavich
Copy link
Contributor Author

output cube target size is 27800 x 10200 x 1000. Probably will limit 3rd dimension to ~200 pixels in some cases.

@keflavich
Copy link
Contributor Author

I revised the original definition of the cube sizes above in reply to #845 (comment)

@astrofrog
Copy link
Member

@keflavich ok thanks! I have started work on supporting dask inputs to reproject functions, will open a draft PR soon.

@astrofrog
Copy link
Member

However, feel free to also try on your side if you want, I will keep an eye on your PR :)

@keflavich
Copy link
Contributor Author

I think we need both of these together, no? I'm adding 3D capability to reproject_and_coadd, you're making it all daskable, right?

@keflavich
Copy link
Contributor Author

ah, no, you're talking about _this_PR. Yeah, I'm not going to work on this while you are - I'm focused on the other support infrastructure for now

@astrofrog
Copy link
Member

I did mean the reproject PR not this one - yes our work should end up being complementary! (I will just focus on the individual reproject functions not coadd)

@codecov
Copy link

codecov bot commented Sep 9, 2023

Codecov Report

Patch coverage is 4.54% of modified lines.

❗ Current head 672245e differs from pull request most recent head df32208. Consider uploading reports for the commit df32208 to get more accurate results

Files Changed Coverage
spectral_cube/dask_spectral_cube.py 4.54%

📢 Thoughts on this report? Let us know!.

@keflavich
Copy link
Contributor Author

@astrofrog I'm trying to test some of the upstream dask changes, but I need this PR as a frontend to those. I've run into a problem - the reproject_interp wrapper:

        def reproject_interp_wrapper(img_slice, **kwargs):
            # What exactly is the wrapper getting here?
            # I think it is given a _cube_ that is a cutout?
            # No, it is getting dask arrays (at least sometimes)
            if filled:
                data = img_slice.filled_data[:]
            else:
                data = img_slice._data
            return reproject_interp((data, img_slice.header),

is wrong - img_slice is (in my implementation above) assumed to be a cube subset. But instead it is an array.

So I think this PR is totally wrong, taking the wrong general approach - instead, perhaps we should just directly call reproject.reproject_interp without going through apply_parallel_spatial?

@astrofrog
Copy link
Member

astrofrog commented Sep 11, 2023

Yes I think the intermediate wrappers pre date some of the reproject improvements. We should be able to just call reproject with the full dask array and target WCS.

@astrofrog
Copy link
Member

@keflavich - I've rebased and pushed an additional commit that I think makes the DaskSpectralCube.reproject method work how it should, that is is doesn't actually do the reprojection and instead passes on the resulting dask array to the new DaskSpectralCube (and I think a user can then choose to dump to tmp?). However, this currently fails with the stable version of reproject due to the tmp_dir bug that we discussed before, so looking into that now.

@astrofrog
Copy link
Member

Ok so with astropy/reproject#390 I can get things to run. However, this is currently not optimal for two reasons:

It might be better to be pragmatic here and actually carry out the reprojection inside the reproject function rather than delaying it, at least by default, because it gives us more control over the points above. I'll try and work on this a little more shortly.

@keflavich
Copy link
Contributor Author

Yes, I agree - the default should be least likely to cause harm on the user's computer (i.e., unlikely to fill up their HDs with tmpdir junk, unlikely to just crash), and exceptional cases needed for performance/other reasons should be described in the docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants